Purpose shapFlex logo

The purpose of shapFlex, short for Shapley flexibility, is to compute stochastic feature-level Shapley values for ensemble models using potentially different, high-dimensional input datasets. The main function in this package is shapFlex::shapFlex().


Benefits

  • Flexibility:
    • Shapley values can be estimated for ensembles of many machine learning models using a simple user-defined predict() wrapper function.
    • Shapley values can be estimated for a given feature if it appears in multiple datasets in a more elaborate ensemble model.
  • Speed:
    • The code itself hasn’t been optimized for speed. The speed advantage of shapFlex comes in the form of giving the user the ability to select 1 or more target features of interest and avoid having to compute Shapley values for all model features. This is especially useful in high-dimensional models as the computation of a Shapley value is exponential in the number of features.


Package Features

  1. Compute ensemble Shapley values for select features with shapFlex::shapFlex().

Background

As algorithmic decision-making continues its headlong rush into our everyday lives, interpretable machine learning has become increasingly relevant for society, machine/deep learning researchers, and data science practicioners alike (1). And while numerous statistical techniques and methodologies have been dedicated to making black-box models and the decisions that they produce more human-friendly, one method in particular, the Shapley value, has a number of desirable properties including its ability to unify several seemingly different methodologies (2).

Briefly, in the context of machine learning, a Shapley value represents the marginal or unique contribution of each model feature toward each prediction. It’s a hyper-local measure of feature importance for each individual or instance which gives a clear profile of how a prediction model is uniquely making predictions for a given instance. Shapley values are measured in the same units in which your model is making predictions: dollars, number of cats, risk, probability etc. The Shapley method of feature influence has a number of practical applications which include (a) detecting bias in machine learning models, (b) training and tuning models by assessing the Shapley values of influential observations, and (c) using the resulting ‘instance * feature’ matrix as an input for a variety of additional machine learning tasks such as clustering.

Below are several good resources to learn more about the features, benefits, and uses of Shapley values for interpreting black-box machine learning models:



Example

  • We’ll demonstrate how to use the shapFlex package with a worked example using the imports85 dataset from the randomForest package.

  • Our goal is to see how each of ~20 factors in a simple machine learning ensemble model uniquely influence the price of a car using approximate Shapley values.

  • The Shapley values will be plotted (a) across all investigated samples/cars to get a measure of global feature importance and (b) for select cars to get a measure of local feature importance.

Install shapFlex

devtools::install_github("nredell/shapFlex")
library(shapFlex)


Load packages and data

library(glmnet)
library(randomForest)
library(ggplot2)
library(plotly)
library(DT)

data("imports85", package = "randomForest")
data <- imports85

data <- data[, -2]  # this column has excessive missing data.
data <- data[complete.cases(data), ]
row.names(data) <- 1:nrow(data)  # re-index for simplicity.

DT::datatable(head(data, 5), options = list(scrollX = TRUE))


Train models

  • Outcome: price

  • Features: ~20

  • Sample size: ~200

  • Models: A LASSO from glmnet and a Random Forest from randomForest.
outcome_col <- which(names(data) == "price")
outcome_name <- names(data)[outcome_col]

model_formula <- formula(paste0(outcome_name,  "~ ."))

set.seed(224)
model_lasso <- glmnet::cv.glmnet(x = model.matrix(model_formula, data), 
                                 y = as.matrix(data[, outcome_col, drop = FALSE], ncol = 1))

set.seed(224)
model_rf <- randomForest::randomForest(formula = model_formula, data = data, ntree = 200)


Prepare shapFlex::shapFlex() Input

User-defined prediction function

  • Required: A user-defined wrapper function that takes the following positional arguments:
    • Position 1: A list of model objects for predicting on the input dataset.
    • Position 2: A data.frame of model features (the outcome column won’t be passed in shapFlex()).

      and returns:

    • A 1-column data.frame with a number of rows equal to the input data.

  • This function can be simple–like below–or contain meta-models for combining ensemble predictions, feature transformations, or model stacking etc.
# models[1] and models[2] could be hardcoded model_lasso and model_rf.
# In the next code block, see the `model_list` object for an example of the format of the 'models' argument.
predict_ensemble <- function(models, data, ...) {
  
  y_pred_lasso <- data.frame(predict(models[[1]], model.matrix(~ ., data)))  # LASSO
  y_pred_rf <- data.frame(predict(models[[2]], data))  # Random Forest
  
  data_pred <- dplyr::bind_cols(y_pred_lasso, y_pred_rf)
  
  data_pred <- data.frame("y_pred" = rowMeans(data_pred, na.rm = TRUE))
  
  return(data_pred)
}

Shapley sampling parameters

  • Below are some of the main arguments to shapFlex::shapFlex() with some context.

  • We’re going to explain all instances in our dataset using all features which is the default setting.

  • Refer to the–forthcoming–vignette to see how shapFlex::shapFlex() runtime and accuracy are affected by the number of model features and monte carlo sample size.

# A list of data.frame(s) of model features suitable for the user-defined predict function(s).
data_list <- list(data[, -(outcome_col), drop = FALSE])
# Dataset row numbers or indicies.
explain_instances <- 1:nrow(data)
# Are the instances to explain row numbers/indices or row names ('row_name') in the input data?
explain_instance_id <- "row_index"
# A list of of model objects. Nested lists of length(data_list) are needed if length(data_list) > 1.
model_list <- list(model_lasso, model_rf)
# A list of length 1 vectors with length(prediction_functions) == length(data_list).
predict_functions <- list("predict_ensemble")
# The number of randomly selected dataset rows used to calculate the feature-level Shapley values.
sample_size <- 100
# Number of cores to use in parallel::mclapply(); limited to 1 on Windows OS.
n_cores <- if (Sys.info()["sysname"] == "Windows") {1} else {parallel::detectCores() - 1}


shapFlex::shapFlex()

Explain with all model features

  • Because shapFlex::shapFlex() explains 1 dataset row/instance at a time, we’ll loop through our dataset and store the results in a list.

  • Let’s also record the runtime and compare it to the time it takes to explain with 1 feature.

explained_instances <- vector("list", length(explain_instances))

start <- Sys.time()
set.seed(224)
for (i in seq_along(explain_instances)) {
  
  explained_instances[[i]] <- shapFlex::shapFlex(data = data_list, 
                                                 explain_instance = explain_instances[i],  # loop
                                                 explain_instance_id = explain_instance_id,
                                                 models = model_list, 
                                                 predict_functions = predict_functions, 
                                                 sample_size = sample_size, 
                                                 n_cores = n_cores)
}
stop <- Sys.time()
list(stop - start, paste0("Cores = ", n_cores))
[[1]]
Time difference of 9.153347 mins

[[2]]
[1] "Cores = 1"


  • The output of shapFlex::shapFlex() is a data.frame with the following columns:
    • explained_instance: The row name or row number of the target instance (number/name depends on the ’explain_instance_id` argument).
    • feature_name: The feature name from names(data) or a subset of all features if using the ‘target_features’ argument. The intercept represents the average prediction in the input dataset–regardless of whether or not any models explicitly use intercepts.
    • feature_value: Taken from the input data. The intercept has an NA feature value. At present, only appears with 1 input dataset.
    • shap_effect: The average Shapley value across samples in ‘sample_size’.
    • shap_effect_sd: The standard deviation in Shapley values across samples in ‘sample_size’. At present, only appears with 1 input dataset.
data_shap <- dplyr::bind_rows(explained_instances)

DT::datatable(data_shap)

Global feature effects

  • Below are plots similar to partial dependence plots for categorical followed by numeric model features.
data_plot <- data_shap

factor_features <- names(data)[which(unlist(lapply(data, function(x) {methods::is(x, "factor")})))]

data_plot <- dplyr::filter(data_plot, feature_name %in% factor_features)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_boxplot()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
p


data_plot <- data_shap

numeric_features <- names(data)[which(unlist(lapply(data, function(x) {!methods::is(x, "factor")})))]

data_plot <- dplyr::filter(data_plot, feature_name %in% numeric_features)

data_plot$feature_value <- as.numeric(data_plot$feature_value)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_point(alpha = .25)
p <- p + geom_smooth()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
suppressWarnings(suppressMessages(print(p)))


Explain with 1 model feature

  • Now, let’s say we’re really interested in the marginal effect of horsepower on car prices but don’t have the time to compute Shapley values for all features.

  • In this case, we’ll set the ‘target_features’ argument and examine its effect.

  • The Shapley values are still calculated in the usual way, and the results are identical if a seed is set, but we get results 14 times faster.

explained_instances <- vector("list", length(explain_instances))

target_features <- list("horsepower")

start <- Sys.time()
set.seed(224)
for (i in seq_along(explain_instances)) {

  explained_instances[[i]] <- shapFlex::shapFlex(data = data_list, 
                                                 explain_instance = explain_instances[i],  # loop
                                                 explain_instance_id = explain_instance_id,
                                                 models = model_list, 
                                                 predict_functions = predict_functions, 
                                                 target_features = target_features, 
                                                 sample_size = sample_size, 
                                                 n_cores = n_cores)
}
stop <- Sys.time()
list(stop - start, paste0("Cores = ", n_cores))
[[1]]
Time difference of 40.19398 secs

[[2]]
[1] "Cores = 1"

Global feature effects

data_shap_var <- dplyr::bind_rows(explained_instances)

data_plot <- data_shap_var

numeric_features <- "horsepower"

data_plot <- dplyr::filter(data_plot, feature_name %in% numeric_features)

data_plot$feature_value <- as.numeric(data_plot$feature_value)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_point(alpha = .25)
p <- p + geom_smooth()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
suppressWarnings(suppressMessages(print(p)))


Feature subset consistency

  • Finally, we’ll check to see that the horsepower Shapley value estimates across our samples are identical when Shapley values are (a) computed for all features and (b) computed soley for horsepower.

  • As expected, the estimates are identical. If we are interested in focusing on horsepower and learning more about how the relationship between horsepower and a car’s price changes as (a) new models are added to the ensemble or (b) the model incorporates new features, we can do so without the overhead of computing Shapley values for all features.

horsepower_shap_all_features <- dplyr::filter(data_shap, feature_name == "horsepower") %>% 
  dplyr::select(shap_effect)
horsepower_shap_1_feature <- dplyr::filter(data_shap_var, feature_name == "horsepower") %>% 
  dplyr::select(shap_effect)

identical(horsepower_shap_all_features, horsepower_shap_1_feature)
[1] TRUE

Local feature importance

  • Because Shapley values give us insight into how each feature uniquely affects each instance’s prediction, we’ll produce a plot of the feature effects profile for a handful of instances.

  • Below is a plot of how each of the modeled features from our ensemble model is impacting predicted prices for the 6 Audis in the dataset (rows 4 through 9 in our filtered dataset).

  • Hover over the bars to see the feature values.

data_plot <- data_shap

audis <- which(data$make == "audi")

data_plot <- dplyr::filter(data_plot, explained_instance %in% audis)

data_plot$bar_color <- ifelse(data_plot$shap_effect >= 0, "increase", "decrease")

p <- ggplot(data_plot, aes(x = feature_name, y = shap_effect, fill = bar_color, 
                           group = explained_instance, text = paste0(feature_name, ": ", feature_value)))
p <- p + geom_bar(stat = "identity", show.legend = FALSE)
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ explained_instance)
p <- p + theme_bw() + theme(legend.position = "none") + coord_flip() + xlab(NULL) + ylab("Shapley value")
plotly::ggplotly(p, tooltip = "text")

---
title: "package::shapFlex Overview"
date: "`r lubridate::today()`"
author: "Nick Redell, nickredell@hotmail.com"
output:
  html_notebook:
    code_folding: show
    toc: yes
    toc_float: yes
---

***

# Purpose <img src="shapFlex_logo.png" alt="shapFlex logo" align="right" width="150" height="150" style="display: inline-block; padding: 5px;">

The purpose of `shapFlex`, short for Shapley flexibility, is to compute stochastic feature-level Shapley values 
for ensemble models using potentially different, high-dimensional input datasets. The main function in this package is `shapFlex::shapFlex()`.

***

# Benefits

* **Flexibility**: 
    + Shapley values can be estimated for ensembles of <u>many machine learning models</u> using a simple user-defined predict() wrapper function.
    + Shapley values can be estimated for a given feature if it appears in <u>multiple datasets</u> in a more elaborate ensemble model.
<br>
* **Speed**:
    + The code itself hasn't been optimized for speed. The speed advantage of `shapFlex` comes in the form of giving the user the ability 
    to <u>select 1 or more target features of interest</u> and avoid having to compute Shapley values for all model features. This is especially 
    useful in high-dimensional models as the computation of a Shapley value is exponential in the number of features.

***

<br>

# Package Features

1. Compute **ensemble Shapley values for select features** with `shapFlex::shapFlex()`.

***

# Background

As algorithmic decision-making continues its headlong rush into our everyday lives, interpretable 
machine learning has become increasingly relevant for society, machine/deep learning researchers, 
and data science practicioners alike ([1](https://arxiv.org/abs/1702.08608)). And while numerous statistical techniques and 
methodologies have been dedicated to making black-box models and the decisions that they 
produce more human-friendly, one method in particular, the Shapley value, has a number of desirable 
properties including its ability to unify several seemingly different methodologies 
([2](http://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions.pdf)). 

Briefly, in the context of machine learning, a Shapley value represents the marginal or unique contribution 
of each model feature toward each prediction. It's a <b>hyper-local measure of feature importance</b> for each 
individual or instance which gives a clear profile of how a prediction model is uniquely making 
predictions for a given instance. Shapley values are measured in the same units in which your model is 
making predictions: dollars, number of cats, risk, probability etc. The Shapley method of feature influence 
has a number of practical applications which include (a) detecting bias in machine learning models, (b) training and
tuning models by assessing the Shapley values of influential observations, and (c) using the 
resulting 'instance * feature' matrix as an input for a variety of additional machine learning tasks such as clustering.

Below are several good resources to learn more about the features, benefits, and uses of Shapley values for 
interpreting black-box machine learning models:

* [Interpretable Machine Learning by Christoph Molnar](https://christophm.github.io/interpretable-ml-book/shapley.html)
* [Python shap package by Scott Lundberg](https://github.com/slundberg/shap)
* [Blog post from the author of shap](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27)

***

```{r, include = FALSE}
knitr::opts_chunk$set(fig.width = 9, fig.height = 6)
```

```{r, include = FALSE}
devtools::load_all(export_all = FALSE)
```

<br>

# Example

* We'll demonstrate how to use the `shapFlex` package with a worked example using the `imports85` dataset from the `randomForest` package.
<p>
* Our goal is to see how each of ~20 factors in a simple machine learning ensemble model uniquely 
influence the price of a car using approximate Shapley values. 
<p>
* The Shapley values will be plotted (a) across all investigated samples/cars to get a measure of 
**global feature importance** and (b) for select cars to get a measure of **local feature importance**.


## Install shapFlex

```{r, eval = FALSE}
devtools::install_github("nredell/shapFlex")
library(shapFlex)
```

<br>

## Load packages and data

```{r, warning = FALSE, message = FALSE}
library(glmnet)
library(randomForest)
library(ggplot2)
library(plotly)
library(DT)

data("imports85", package = "randomForest")
data <- imports85

data <- data[, -2]  # this column has excessive missing data.
data <- data[complete.cases(data), ]
row.names(data) <- 1:nrow(data)  # re-index for simplicity.

DT::datatable(head(data, 5), options = list(scrollX = TRUE))
```

***

<br>

## Train models

* **Outcome**: price
<p>
* **Features**: ~20
<p>
* **Sample size**: ~200
<p>
* **Models**: A LASSO from `glmnet` and a Random Forest from `randomForest`.

```{r}
outcome_col <- which(names(data) == "price")
outcome_name <- names(data)[outcome_col]

model_formula <- formula(paste0(outcome_name,  "~ ."))

set.seed(224)
model_lasso <- glmnet::cv.glmnet(x = model.matrix(model_formula, data), 
                                 y = as.matrix(data[, outcome_col, drop = FALSE], ncol = 1))

set.seed(224)
model_rf <- randomForest::randomForest(formula = model_formula, data = data, ntree = 200)
```

***

<br>

## Prepare shapFlex::shapFlex() Input

### User-defined prediction function

* Required: A user-defined wrapper function that takes the following **positional arguments**:
    + **Position 1**: A list of model objects for predicting on the input dataset.
    + **Position 2**: A data.frame of model features (the outcome column won't be passed in `shapFlex()`).  
<br>
and **returns**:
<br>
<br>
    + A 1-column data.frame with a number of rows equal to the input data.
<p>
* This function can be simple--like below--or contain meta-models for combining ensemble predictions, 
feature transformations, or model stacking etc.

```{r}
# models[[1]] and models[[2]] could be hardcoded model_lasso and model_rf.
# In the next code block, see the `model_list` object for an example of the format of the 'models' argument.
predict_ensemble <- function(models, data, ...) {
  
  y_pred_lasso <- data.frame(predict(models[[1]], model.matrix(~ ., data)))  # LASSO
  y_pred_rf <- data.frame(predict(models[[2]], data))  # Random Forest
  
  data_pred <- dplyr::bind_cols(y_pred_lasso, y_pred_rf)
  
  data_pred <- data.frame("y_pred" = rowMeans(data_pred, na.rm = TRUE))
  
  return(data_pred)
}
```


### Shapley sampling parameters

* Below are some of the main arguments to `shapFlex::shapFlex()` with some context.

* We're going to explain all instances in our dataset using all features which is the default 
setting.

* Refer to the--forthcoming--vignette to see how `shapFlex::shapFlex()` runtime and accuracy 
are affected by the number of model features and monte carlo sample size.

```{r}
# A list of data.frame(s) of model features suitable for the user-defined predict function(s).
data_list <- list(data[, -(outcome_col), drop = FALSE])
# Dataset row numbers or indicies.
explain_instances <- 1:nrow(data)
# Are the instances to explain row numbers/indices or row names ('row_name') in the input data?
explain_instance_id <- "row_index"
# A list of of model objects. Nested lists of length(data_list) are needed if length(data_list) > 1.
model_list <- list(model_lasso, model_rf)
# A list of length 1 vectors with length(prediction_functions) == length(data_list).
predict_functions <- list("predict_ensemble")
# The number of randomly selected dataset rows used to calculate the feature-level Shapley values.
sample_size <- 100
# Number of cores to use in parallel::mclapply(); limited to 1 on Windows OS.
n_cores <- if (Sys.info()["sysname"] == "Windows") {1} else {parallel::detectCores() - 1}
```

***

<br>

## shapFlex::shapFlex()

### Explain with all model features

* Because `shapFlex::shapFlex()` explains 1 dataset row/instance at a time, we'll loop through 
our dataset and store the results in a list.

* Let's also record the runtime and compare it to the time it takes to explain with 1 feature.

```{r}
explained_instances <- vector("list", length(explain_instances))

start <- Sys.time()
set.seed(224)
for (i in seq_along(explain_instances)) {
  
  explained_instances[[i]] <- shapFlex::shapFlex(data = data_list, 
                                                 explain_instance = explain_instances[i],  # loop
                                                 explain_instance_id = explain_instance_id,
                                                 models = model_list, 
                                                 predict_functions = predict_functions, 
                                                 sample_size = sample_size, 
                                                 n_cores = n_cores)
}
stop <- Sys.time()
list(stop - start, paste0("Cores = ", n_cores))
```

***

<br>

* The output of `shapFlex::shapFlex()` is a data.frame with the following columns:
    + **explained_instance**: The row name or row number of the target instance (number/name depends on the 'explain_instance_id` argument).
    + **feature_name**: The feature name from names(data) or a subset of all features if using the 'target_features' argument. The 
    intercept represents the average prediction in the input dataset--regardless of whether or not any models explicitly use intercepts.
    + **feature_value**: Taken from the input data. The intercept has an NA feature value. At present, only appears with 1 input dataset.
    + **shap_effect**: The average Shapley value across samples in 'sample_size'.
    + **shap_effect_sd**: The standard deviation in Shapley values across samples in 'sample_size'. At present, only appears with 1 input dataset.

```{r}
data_shap <- dplyr::bind_rows(explained_instances)

DT::datatable(data_shap)
```

***

#### Global feature effects

* Below are plots similar to partial dependence plots for categorical followed by numeric model features.

```{r}
data_plot <- data_shap

factor_features <- names(data)[which(unlist(lapply(data, function(x) {methods::is(x, "factor")})))]

data_plot <- dplyr::filter(data_plot, feature_name %in% factor_features)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_boxplot()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
p
```

***

```{r, results = "hide"}
data_plot <- data_shap

numeric_features <- names(data)[which(unlist(lapply(data, function(x) {!methods::is(x, "factor")})))]

data_plot <- dplyr::filter(data_plot, feature_name %in% numeric_features)

data_plot$feature_value <- as.numeric(data_plot$feature_value)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_point(alpha = .25)
p <- p + geom_smooth()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
suppressWarnings(suppressMessages(print(p)))
```

***

### Explain with 1 model feature

* Now, let's say we're really interested in the marginal effect of horsepower on car prices but don't 
have the time to compute Shapley values for all features.

* In this case, we'll set the 'target_features' argument and examine its effect.

* The Shapley values are still calculated in the usual way, and the **results are identical** if a seed is set, 
but we **get results 14 times faster**.

```{r}
explained_instances <- vector("list", length(explain_instances))

target_features <- list("horsepower")

start <- Sys.time()
set.seed(224)
for (i in seq_along(explain_instances)) {

  explained_instances[[i]] <- shapFlex::shapFlex(data = data_list, 
                                                 explain_instance = explain_instances[i],  # loop
                                                 explain_instance_id = explain_instance_id,
                                                 models = model_list, 
                                                 predict_functions = predict_functions, 
                                                 target_features = target_features, 
                                                 sample_size = sample_size, 
                                                 n_cores = n_cores)
}
stop <- Sys.time()
list(stop - start, paste0("Cores = ", n_cores))
```

***

#### Global feature effects

```{r, results = "hide"}
data_shap_var <- dplyr::bind_rows(explained_instances)

data_plot <- data_shap_var

numeric_features <- "horsepower"

data_plot <- dplyr::filter(data_plot, feature_name %in% numeric_features)

data_plot$feature_value <- as.numeric(data_plot$feature_value)

p <- ggplot(data_plot, aes(feature_value, shap_effect))
p <- p + geom_point(alpha = .25)
p <- p + geom_smooth()
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ feature_name, scales = "free")
p <- p + theme_bw() + xlab(NULL) + ylab("Shapley value (0 is the average prediction)")
suppressWarnings(suppressMessages(print(p)))
```

***

### Feature subset consistency

* Finally, we'll check to see that the horsepower Shapley value estimates across our samples are 
identical when Shapley values are (a) computed for all features and (b) computed soley for horsepower.

* As expected, **the estimates are identical**. If we are interested in focusing on horsepower and 
learning more about how the relationship between horsepower and a car's price changes as (a) new models 
are added to the ensemble or (b) the model incorporates new features, we can do so without the 
overhead of computing Shapley values for all features.

```{r}
horsepower_shap_all_features <- dplyr::filter(data_shap, feature_name == "horsepower") %>% 
  dplyr::select(shap_effect)
horsepower_shap_1_feature <- dplyr::filter(data_shap_var, feature_name == "horsepower") %>% 
  dplyr::select(shap_effect)

identical(horsepower_shap_all_features, horsepower_shap_1_feature)
```

***

### Local feature importance

* Because Shapley values give us insight into how each feature uniquely affects each instance's prediction, 
we'll produce a plot of the feature effects profile for a handful of instances.

* Below is a plot of how each of the modeled features from our ensemble model is impacting predicted prices 
for the 6 Audis in the dataset (rows 4 through 9 in our filtered dataset).

* Hover over the bars to see the feature values.

```{r, fig.width = 8}
data_plot <- data_shap

audis <- which(data$make == "audi")

data_plot <- dplyr::filter(data_plot, explained_instance %in% audis)

data_plot$bar_color <- ifelse(data_plot$shap_effect >= 0, "increase", "decrease")

p <- ggplot(data_plot, aes(x = feature_name, y = shap_effect, fill = bar_color, 
                           group = explained_instance, text = paste0(feature_name, ": ", feature_value)))
p <- p + geom_bar(stat = "identity", show.legend = FALSE)
p <- p + scale_y_continuous(label = scales::dollar)
p <- p + facet_wrap(~ explained_instance)
p <- p + theme_bw() + theme(legend.position = "none") + coord_flip() + xlab(NULL) + ylab("Shapley value")
plotly::ggplotly(p, tooltip = "text")
```

***

